Private Record Linkage: Comparison of Selected Techniques for Name Matching
نویسندگان
چکیده
Grzebala, Pawel. M.S.C.E. Department of Computer Science and Engineering, Wright State University, 2016. Private Record Linkage: A Comparison of Selected Techniques for Name Matching. The rise of Big Data Analytics has shown the utility of analyzing all aspects of a problem by bringing together disparate data sets. Efficient and accurate private record linkage algorithms are necessary to achieve this. However, records are often linked based on personally identifiable information, and protecting the privacy of individuals is critical. This work contributes to this field by studying an important component of the private record linkage problem: linking based on names while keeping those names encrypted, both on disk and in memory. We explore the applicability, accuracy, speed and security of three different primary approaches to this problem (along with several variations) and compare the results to common name-matching metrics on unprotected data. While these approaches are not new, this work provides a thorough analysis on a range of datasets containing systematically introduced flaws common to name-based data entry, such as typographical errors, optical character recognition errors, and phonetic errors. Additionally, we evaluate the privacy level of the q-grams based metrics by simulating the frequency analysis attack that can occur in case of potential data breaches. We show that, for the use case we are considering, the best choice of string metric are padded q-gram based metrics which can provide high record linkage accuracy and are resilient to frequency analysis attack under certain conditions.
منابع مشابه
An Empirical Comparison of Approaches to Approximate String Matching in Private Record Linkage
Due to the frequency of spelling and typographical errors in practical applications, record linkage algorithms have to use string similarity functions. In many legal contexts, identifiers such as names have to be encrypted before a record linkage can be attempted. Therefore, algorithms for computing string similarity functions with encrypted identifiers are essential for approximating string ma...
متن کاملScaling Private Record Linkage using Output Constrained Differential Privacy
Many scenarios require computing the join of databases held by two or more parties that do not trust one another. Private record linkage is a cryptographic tool that allows such a join to be computed without leaking any information about records that do not participate in the join output. However, such strong security comes with a cost: except for exact equi-joins, these techniques have a high ...
متن کاملLearning Blocking Schemes for Record Linkage
Record linkage is the process of matching records across data sets that refer to the same entity. One issue within record linkage is determining which record pairs to consider, since a detailed comparison between all of the records is impractical. Blocking addresses this issue by generating candidate matches as a preprocessing step for record linkage. For example, in a person matching problem, ...
متن کاملTree Based Scalable Indexing for Multi-Party Privacy-Preserving Record Linkage
Recently, the linking of multiple databases to identify common sets of records has gained increasing recognition in application areas such as banking, health, insurance, etc. Often the databases to be linked contain sensitive information, where the owners of the databases do not want to share any details with any other party due to privacy concerns. The linkage of records in different databases...
متن کاملReal World Performance of Approximate String Comparators for use in Patient Matching
Medical record linkage is becoming increasingly important as clinical data is distributed across independent sources. To improve linkage accuracy we studied different name comparison methods that establish agreement or disagreement between corresponding names. In addition to exact raw name matching and exact phonetic name matching, we tested three approximate string comparators. The approximate...
متن کامل